调度

[2411.16102] BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

在离线任务中有Compute Bound和Communication Bound两种任务，该工作是在最大化计算和存储的资源重叠这一话题的延续工作。创新点在于支持Prefix Sharing背景下的最大离线吞吐量问题。

KV Cache

[2404.04793] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

减少某些层KV Cache存储的工作

长文本

[2411.17116] Star Attention: Efficient LLM Inference over Long Sequences

修改Attention算法，分块处理长文本

results matching ""

No results matching ""